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The objectives of this Perspective paper are to review 
some recent advances in sparse feature selection for 
regression and classification, as well as compressed 
sensing, and to discuss how these might be used to 
develop tools to advance personalized cancer therapy. 
As an illustration of the possibilities, a new algorithm 
for sparse regression is presented and is applied to 
predict the time to tumour recurrence in ovarian 
cancer. A new algorithm for sparse feature selection in 
classification problems is presented, and its validation 
in endometrial cancer is briefly discussed. Some open 
problems are also presented. 



1. Introduction 

The objectives of this Perspective paper are to review 
some recent advances in sparse feature selection for 
regression and classification, and to discuss how these 
might be used in the computational biology of cancer. 
One of the motivations for writing this paper is to present 
a broad picture of some recent advances in machine 
learning to the more mathematically inclined within the 
cancer biologist community, and to apply some of these 
techniques to a couple of problems. Full expositions 
of these applications will be presented elsewhere. In 
the other direction, it is hoped that the paper will 
also facilitate the entry of interested researchers from 
the machine learning community into cancer biology 
In order to understand the computational aspects of the 
problems described here, a basic grasp of molecular 
biology is sufficient, as can be obtained from standard 
references, for example Northrop & Connor [1] and 
Tozeren & Byers [2]. 
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Cancer is the second leading cause of death in the USA [3]. It is estimated that in the USA in 
2013 there will be 1 660290 new cases of cancer in all sites, and 589350 deaths [4]. In the UK, in 
2011 there were 331487 cases of cancer, and 159 178 deaths; both are the latest figures available 
[5]. Worldwide, cancer led to about 7.6 million deaths in 2008 [6]. It is interesting to note that, 
whether in developed countries such as the USA and the UK or worldwide, cancer accounts for 
roughly 13% of all deaths [6]. 

One of the major challenges faced by cancer researchers is that no two manifestations of cancer 
are alike, even when they occur in the same site. One can paraphrase the opening sentence of Leo 
Tolstoy's Anna Karenina and say that 'Normal cells are all alike. Every malignant cell is malignant 
in its own way/ Thus, cancer would be an ideal target for 'personalized medicine', in which 
therapy is custom-tailored to each patient. Unfortunately, our current level of understanding 
of the disease does not permit us to develop truly personalized therapies for every individual 
patient. Therefore, it is necessary to settle for an intermediate approach, which might be described 
as 'patient stratification'. In this approach, diverse manifestations of a particular type of cancer 
are grouped into a small number of classes, wherein the manifestations are broadly similar within 
each class and substantially different between classes. Then attempts can be made to develop 
therapeutic regimens that are tailored for each class. 

Until recently, grouping of cancers has been attempted first through the site of the cancer, and 
then through histological considerations, that is, the microscopic anatomy of the cells comprising 
the tumour, and other parameters that can be measured by a physical examination of the tumour. 
For example, lung cancer is divided into two broad categories, namely small-cell lung cancer 
(SCLC) and non-small-cell lung cancer (NSCLC), where the prognosis for the latter is decidedly 
better than for the former. Then, NSCLC is divided into three subtypes known as adenocarcinoma, 
squamous cell carcinoma and large-cell carcinoma. All of these subtypes are defined on the basis 
of histology. But this is not the only possible approach. It is also possible to define the subtypes 
on the basis of the molecular-level properties of the cancer tumour. For instance, there are four 
major types of breast cancer, known as luminal A, luminal B, non-luminal and basal type. These 
subtypes are defined based on the expression levels of the genes oestrogen receptor, progesterone 
receptor and HER2, also known as ERBB2, being either high or low. The basal-like subtype, also 
known as the triple negative subtype owing to the fact that all three genes are expressed at very 
low levels, constitutes about 20% of breast cancer cases and has the worst prognosis. For the 
other three subtypes, there are some proved therapies that work reasonably well; but this is not 
so for triple negative subtypes. The above subtyping illustrates the type of challenges faced by 
a mathematically trained person when studying computational biology. For instance, given that 
there are three genes being studied, and that the expression level of each can be either high or 
low, a mathematician/ engineer might think that there are 2 3 = 8 possible subtypes. In reality 
however, as stated above, there are only four subtypes, and some of the possible combinations 
do not seem to occur sufficiently frequently. 1 The therapies for the various subtypes are quite 
different. Therefore, it is important to be able to ascertain to which subtype a patient belongs, 
before commencing therapy. This is one possible application of machine learning. 

During the past decade, attempts have been made to collect the experimental data generated 
by various research laboratories into central repositories such as the Gene Expression Omnibus 
[8] and the Catalogue of Somatic Mutations in Cancer (COSMIC) [9]. However, the data in these 
repositories are often collected under widely varying experimental conditions. Moreover, the 
standard of reporting and documentation is not always uniform. To mitigate this problem, there 
are now some massive public projects underway for generating vast amounts of data for all the 
tumours that are available in various tumour banks, using standardized sets of experimental 
protocols. Among the most ambitious are The Cancer Genome Atlas, usually referred to by the 
acronym TCGA [10], which is undertaken by the National Cancer Institute, and the International 
Cancer Genome Consortium, referred to also as ICGC [11], which is a multi-country effort. 

^ee Malhotra et al. [7] for a more refined partitioning into six subtypes. However, the refined subtyping involves other genes, 
not only these three. 



In the TCGA data, molecular measurements are available for almost all tumours, and clinical 
annotations are also available for many tumours. 

With such a wealth of data becoming freely available, researchers in the machine learning 
community can now aspire to make useful contributions to cancer biology without the need to 
undertake any experimentation themselves. However, in order to carry out meaningful research, 
it is essential to have a close collaboration with one or more biologists. The style of exposition in 
the biological literature is quite different from that in mathematical books and papers, and the 
author's experience has been that simply conversing with expert biologists is the fastest way to 
become familiar with the subject. 2 Unlike in mathematics, in biology it is not possible to derive 
everything from a few fundamental axioms and /or principles; instead, one is confronted with 
a bewildering variety of terms and names, all of which have to be mastered (memorized?) in 
parallel. One example, as mentioned above in connection with breast cancer, is that the names 
ERBB2, HER2 and HER2/Neu all refer to the same gene. Also, while it is not necessary to 
perform experiments oneself, it is absolutely crucial to understand the nature of the experiments, 
so that one is aware of the potential sources of error and the level of reliability of specific types of 
molecular measurements. 

Owing to space limitations, in this paper only two out of the many possible applications of 
machine learning to cancer are addressed, namely sparse regression and sparse classification. 
Other topics such as network inference and modelling tumour growth are mentioned very briefly 
in passing towards the end of the paper. 

Now, we briefly state the class of problems under discussion in this paper. This also serves 
to define the notation used throughout. Let m denote the number of tumour samples that are 
analysed, and let n denote the number of attributes, referred to as 'features', that are measured 
on each sample. Typically, m is of the order of a few dozen in small studies, ranging up to several 
hundreds for large studies such as the TCGA studies, while n is of the order of tens of thousands. 
There are 20 000 or so genes in the human body, and in whole genome studies, and the expression 
level of each gene is measured by at least one 'probe', and sometimes by more than one. The 'raw' 
expression level of a gene corresponds to the amount of messenger RNA that is produced and 
is therefore a non-negative number. However, the raw value is often transformed by taking the 
logarithm after dividing by a reference value, subtracting a median value, dividing by a scaling 
constant and the like. As a result, the numbers that are reported as gene expression levels can 
sometimes be negative numbers. Therefore, it is best to think of gene expression levels as real 
numbers. Other features that are measured include micro-RNA (miRNA) levels, methylation 
levels and copy number variations, all of which can be thought of as real-valued. There are also 
binary features such as the presence or absence of a mutation in a specific gene. In addition 
to these molecular attributes, there are also 'labels' associated with each tumour. Let yi denote 
the label of tumour i, and note that the label depends only on the sample index i and not the 
feature index j. Typical real-valued labels include the time of overall survival after surgery, time 
to tumour recurrence or the lethality of a drug on a cancer cell line. Typical binary labels include 
whether a patient had metastasis (cancer spreading beyond the original site). In addition, it is 
also possible for labels to be ordinal variables, such as 'poor responder', 'medium responder' and 
'good responder'. Often these ordinal labels are merely quantized versions of some other real- 
valued attributes. For instance, the previous example corresponds to a three-level quantization of 
the time to tumour recurrence. In general, the labels refer to clinical outcomes, as in all of the above 
examples. Usually, each sample has multiple labels associated with it. However, in applications, 
the labels are treated one at a time, so it is assumed that there is only one label for each sample, 
with yi denoting the label of the zth sample. Moreover, for simplicity, it is assumed that the labels 
are either real-valued or binary. 

Thus, the measurement set can be thought of as an m x n matrix X = [xij], where X(j is the 
value of feature j in sample i. The row vector x l , denoting the zth row of the matrix X, is called the 

2 My biology collaborator, Prof. Michael A. White of the UT Southwestern Medical Center, says that the same is true in the 
opposite direction as well. He and his students find it easier to understand algorithms by just talking to me and my students. 
Perhaps there is a lesson in that. 



feature vector associated with sample i. Similarly, the column vector xj denotes the variation of 
the ;th feature across all m samples. Throughout this paper, it is assumed that X e R mxn , that is, 
that each measurement is a real number. Binary measurements such as the presence or absence of 
mutations are usually handled by partitioning the sample set into two groups, corresponding to 
the two labels. For the purposes of incorporating binary labels into numerical computation, the 
labels are taken as ±1, the so-called bipolar case. It does not matter which abstract label is mapped 
into +1 and which abstract label is mapped into — 1. If y z is bipolar, the associated problem is called 
'classification', whereas if y z - is real the associated problem is called 'regression'. In either case, 
the objective is to find a function/ : R n -> R or/ : R n -> {—1, 1} such that y\ is well approximated 

by/(x ! '). 



2. Regression methods 

The focus in this section is on the case where the label yi is a real number. Therefore, the objective 
is to find a function/: R n -> R such that/(x z ) is a good approximation of yi for all i. A typical 
application in cancer biology would be the prediction of the time for a tumour to recur after 
surgery. The data would consist of expression levels of tens of thousands of genes on around 
a hundred or so tumours, together with the time for the tumour to recur for each patient. The 
objective is to identify a small number of genes whose expression values would lead to a reliable 
prediction of the recurrence time. Cancer is a complex, multi-genic disease, and identifying a 
small set of genes that appear to be highly predictive in a particular form of cancer would be very 
useful. Explaining why these genes are the key genes would require constructing gene regulatory 
networks (GRNs). While this problem is also amenable to treatment using statistical methods, it 
is beyond the scope of this paper. Towards the end of this section, the tumour recurrence problem 
is studied using a new regression method. 



(a) Some well-established algorithms 

Throughout this section, attention is focused on linear regressors, with/(x) = xw — 0, where w e R n 
is a weight vector and 6 e R is a threshold or bias. There are several reasons for restricting 
attention to linear regressors. From a mathematical standpoint, linear regressors are by far the 
most widely studied and the best understood class of regressors. From a biological standpoint, 
it makes sense to suppose that the measured outcome is a weighted linear combination of 
each feature, with perhaps some offset term. If one were to use higher order polynomials, for 
example, then biologists would rightly object that taking the product of two features (say two 
gene expression values) is unrealistic most of the time. 3 Other possibilities include pre-processing 
each feature Xjj through a function such as e x /(l + e x ), but this is still linear regression in 
terms of the processed values. As explained earlier, often the measured feature values are 
themselves processed values of the corresponding 'raw' measurements. 

In traditional least-squares regression, the objective is to choose a weight vector w e R n and a 
threshold 0 so as to minimize the least-squared error 

m 
i=l 

This method goes back to Legendre and Gauss and is the staple of researchers everywhere. Let e 
denote a column vector of all ones, with the subscript denoting the dimension. Then, 

J LS =\\Xw-6e m -y\\l=\\Xw-y\\l 



3 There are situations, such as transcription factor genes regulating other genes, where taking such a product would be 
realistic. But such situations are relatively rare. 
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If the matrix X has full column rank of n + 1, then it is easy to see that the unique optimal choice 
is given by 

-l 

wi s = (x t xy 1 x t y = 



x l x 

-elX 



m 



In the present context, the fact that m <n ensures that the matrix X has rank less than ft, whence 
the matrix X has rank less than n + 1. As a result, the standard least-squares regression problem 
does not have a unique solution. Therefore, one attempts to minimize the least-squares error while 
imposing various constraints (or penalties) on the weight vector w. 4 Different constraints lead to 
different problem formulations. An excellent and very detailed treatment of the various topics of 
this section can be found in Hastie et al. [12, ch. 3]. 

Suppose we minimize the least-squared error objective function subject to an l2-rvorvcv 
constraint on w. This approach to finding a unique set of weights is known as 'ridge regression' 
and is usually credited to Hoerl & Kennard [13]. However, several of the key ideas are found in a 
much earlier paper by the Russian mathematician Tikhonov [14]. In ridge regression, the problem 
is reformulated as 



min ^2(x l w ■ 

i=l 



-\jif s.t. \\w\\i<t, 



where t is some prespecified bound. In the associated Lagrangian formulation, the problem 
becomes one of the minimizing objective function 

m 

/ridge := YM W ~ 6 ~ + X W W \$> < 2 ' 2 ) 
i=l 

where k is the Lagrange multiplier. Because of the additional term, the (1, l)-block of the Hessian 
of /ridge/ which is the Hessian of / r idge with respect to w, now equals Xl n + X l X, which is 
positive definite even when m < ft. Therefore, the overall Hessian matrix is positive definite 
under a mild technical condition, and the problem has a unique solution for every value of the 
Lagrange parameter X. However, the major disadvantage of ridge regression is that, in general, 
every component of the optimal weight vector w r ^ e is non-zero. In the context of biological 
applications, this means that the regression function makes use of every feature Xj, which is in 
general undesirable. 

Another possibility is to choose a solution w that has the fewest number of non-zero 
components, that is, a regressor that uses the fewest number of features. Define 

/ n \ VP 

lfp>l, this is the familiar £ p -norm. If p < 1, this quantity is no longer a norm, as the function w \-> 
|| w ||p is no longer convex. However, as p | 0, the quantity \\w\\ v approaches the number of non-zero 
components of w. For this reason, it is common to refer to the number of non-zero components of 
a vector as its 'lo-norm' even though || • ||o is not a norm at all. Moreover, it is known [15] that the 
problem of minimizing ||w||o is NP-hard. 



Note that no penalty is imposed on the threshold 9. 



A very general formulation of the regression problem is to minimize 

m 

Jm := YM w -°- y^ 2 + n ^> ( 2 - 3 ) 

i=l 

where 1Z : R w -> R + is a norm known as the 'regularized. This problem is analysed at a very high 
level of generality in Negabhan et ah [16], where the least-squares error term is replaced by an 
arbitrary convex 'loss' function. In the interests of simplicity, we do not discuss the results of 
Negabhan et ah [16] in their full generality and restrict the discussion to the case where the loss 
function is quadratic as in (2.3). 

In Tibshirani [17], it is proposed to minimize the least-squared error objective function subject 
to an €i-norm constraint on the weight vector w. In Lagrangian formulation, the problem is to 
minimize 

m 

/lasso := ~°- Vi) 1 + ( 2 - 4 ) 

i=l 

where X is the Lagrange multiplier. The acronym 'LASSO' is coined in Tibshirani [17] and stands 
for 'least absolute shrinkage and selection operator'. The LASSO penalty can be rationalized by 
observing that || • ||i is the convex relaxation of the '^o-norm'. The behaviour of the solution to 
the LASSO algorithm depends on the choice of the upper bound t. A detailed analysis of the 
Lagrangian formulation (2.4) and its dual problem is carried out in Osborne et ah [18]. It is shown 
there that, if the Lagrange multiplier X in (2.4) is sufficiently large, say X > X max , then the only 
solution to the LASSO minimization problem is w = 0. Moreover, the threshold A max is not easy to 
estimate a priori. An optimal solution is defined to be 'regular' in Osborne et ah [18, definition 3.3] 
if it satisfies some technical conditions. In every problem, there is at least one regular solution. 
Moreover, every regular optimal weight vector has at most m non-zero entries (see Osborne et ah 
[18, theorem 3.5]). 

In many applications, some of the columns of the matrix X are highly correlated. For instance, 
if the indices j and k correspond to two genes that are in the same biological pathway, then 
their expression levels would vary in tandem across all samples. Therefore, the column vectors xj 
and Xfc would be highly correlated. In such a case, ridge regression tends to assign nearly equal 
weights to each. At the other extreme, LASSO tends to choose just one among the many correlated 
columns and to discard the rest; which one gets chosen is often a function of the 'noise' in the 
measurements. In biological datasets, it is reasonable to expect that expression levels of genes that 
are in a common pathway are highly correlated. In such a situation, it is undesirable to choose just 
one among these genes and to discard the rest; it is also undesirable to choose all of them, as that 
would lead to too many features being chosen. It would be desirable to choose more than one, but 
not all, of the correlated columns. This is achieved by the so-called 'elastic net' (EN) algorithm, 
introduced in Zou & Hastie [19], which is a variation of the LASSO algorithm. In this algorithm, 
the penalty aims to constrain, not the €i-norm of the weight w, but a weighted sum of its €i-norm 
and 12-vloxvcy squared. The problem formulation in this case, in Lagrangian form, is to choose w 
so as to minimize 

n 

/en := YM w - 6 - y^ 2 + + a - n)w\ii (^) 

i=l 

where fie (0,1). Note that if /x = 0, then the EN algorithm becomes the LASSO, whereas with 
li = l, the EN algorithm becomes ridge regression. Thus, the EN algorithm provides a bridge 
between the two. Note that the penalty term in the EN algorithm is not a norm, owing to the 
presence of the squared term; hence, the EN algorithm is not covered by the very thorough 
analysis in Negabhan et ah [16]. A useful property of the EN algorithm is brought out in Zou 
& Hastie [19, theorem 1]. 

Theorem 2.1. Assume that y,X,\ are fixed, and let w denote the corresponding minimizer of (2.5). 
Assume without loss of generality that y is centred, that is, y t e m = 0, and that the columns of X 



are normalized such that ||xy||2 = l for all ]. Let j,k be two indices between 1 and n, and suppose 
that xjjfc > 0. Then, 

\Wj - w k \ < l ^^2(l-x]x k ). (2.6) 

As one can always ensure that xjxJJ. > 0 by replacing x k by — x k if necessary, (2.6) states that if the 
columns Xj and x k are highly correlated, then the corresponding coefficients in the regressor are 
nearly equal. Unlike in the LASSO algorithm, there do not seem to be many results on the number 
of non-zero weights that are chosen by the EN algorithm. It can and often does happen that the 
number of features chosen is larger than m, the number of samples. However, as explained above, 
this is often seen as a desirable feature when the columns of the matrix X are highly correlated, as 
they often are in biology datasets. 

By now both LASSO and EN can be viewed as well-established algorithms. A search of the 
Pubmed database of the National Library of Medicine with strings 'LASSO cancer' or 'EN cancer' 
results in about 200 entries for the former and several dozen entries for the latter. Note that 
these numbers are an order of magnitude less than the corresponding numbers for the support 
vector machine (SVM), discussed in %2b. Many of the papers citing the LASSO algorithm do not 
directly apply the algorithm to cancer data; instead, they propose some variant of the algorithm 
and claim to show that their variant outperforms the standard LASSO algorithm. A surprisingly 
large number of these variants propose non-convex objective functions (such as the '£p-norm' with 
p < 1). Given that, in convex optimization, every local optimum is also a global optimum, whereas 
this is not so in the case of non-convex optimization, it is difficult to imagine what benefits if 
any are conferred by replacing the convex objective function /lasso with a non-convex objective 
function. But there are many such papers to be found in the literature. In the case of the EN 
algorithm, a typical application is found in Lee et al. [20] that addresses the problem of identifying 
some genes to delineate advanced versus early stage colorectal cancer. In this study, 1192 known 
or putative cancer genes found from Network [21] and COSMIC [9] constitute the feature set 
on 197 samples. As expected, the EN algorithm chooses a large number of features, which are 
then rank-ordered to determine the key genes. An interesting paper [22] compares all the three 
methods discussed here, namely ridge regression, LASSO and EN, on several datasets both 
synthetic and real, including a lung adenocarcinoma dataset. Not surprisingly, ridge regression 
assigns a non-zero weight to all 1310 features, whereas EN assigns zero weights to only 43 
features, thus resulting in no significant reduction in the number of features chosen. The paper 
does not clearly mention how many features are retained by the LASSO algorithm. 

(b) Some recent algorithms and open problems 

Next, we discuss several versions of the problem formulation in (2.3) corresponding to diverse 
choices of the penalty norm 1Z, culminating in some open problems that are relevant to biological 
applications. The 'pure' LASSO algorithm tries to choose as few distinct features as possible in 
the regressor. However, it may be worthwhile to partition the set of features M = {1, . . . , n] into g 
groups Gi, . . . , G g/ and then choose a regressor that selects elements from as few distinct groups 
as possible, without worrying about the number of features chosen. This is achieved by the so- 
called group LASSO (GL) algorithm introduced in Bakin [23] and Lin & Zhang [24]. Let n\ := |G/| 
for / = 1, . . . ,g. In the grouped LASSO algorithm, the objective function is 

m g 

jgl = YM w - 6 - y^ 2 + A E Vni\\w Gl ii 2 , (2.7) 
i=i i=i 

where wq e M n is determined from w by setting zvj = 0 for all )^G\. It is clear that, depending on 
the relative sizes of the various groups, one weight vector can have more non-zero components 
than another, and yet the number of distinct groups to which these non-zero components belong 
can be smaller. In the limiting case, if the number of groups is taken as n and each group is taken 
to consist of a singleton set, then the grouped LASSO reduces to the standard LASSO algorithm. 



Figure 1. Two regulatory networks, (a) A network without overlapping groups, [b) A network with overlapping groups. 



A further variation is the so-called sparse GL (SGL) algorithm introduced in Friedman et al. [25] 
and Simon et al. [26], where the objective is simultaneously to choose features from as few distinct 
groups as possible, and within the chosen groups, choose as few features as possible. The objective 
function in the SGL algorithm is 

m g 

/SGL = YM W ~ 6 ~ Vi? + X " riW w Gi 111 + Mll^Q 112], (2.8) 

i=l 1=1 

where as always \i e [0, 1]. 

The above formulations of the GL and SGL norms are based on the assumption that the 
various groups do not overlap. However, in some biological applications it makes sense to permit 
overlapping group decompositions. Specifically, at a first level of approximation a GRN can 
be modelled as a directed acyclic graph, wherein the root nodes can be interpreted as master 
regulator genes, and directed paths can be interpreted as biological pathways. In such a case, one 
seeks to explain the available data, not by choosing the fewest number of genes but rather by the 
fewest number of pathways. To illustrate, consider the baby example shown in figure 1, where 
gene 1 is a master regulator, while genes 2-7 are regulated genes. Some are regulated directly 
by a master regulator gene, whereas others are indirectly regulated. In figure la, there are four 
pathways, namely 

Gi = {1,2,4}, G 2 = {1,2,5}, G 3 = {1,3,6} and G 4 = {1,3,7}, 

whereas in figure lb there are also four pathways, namely 

Gi = {1,2,4}, G 2 = {1,2,5}, G 3 = {1,3,5} and G 4 = {1,3,6}. 

Ideally, we would like to choose a set of features that intersect with as few pathways as possible. 
We will return to this example after presenting available theories for sparse regression with 
overlapping groups. 

To date, various versions of group or SGL with overlapping groups have been proposed. As 
before, let G\, . . . , Gg be subsets of AT = {1, ... , ft}, but now without the assumption that the groups 
are pairwise disjoint. The penalty-augmented optimization problems are the same as in (2.7) and 
(2.8), respectively; however, the objective functions are now referred to as Jglo an d /sglo to 
suggest (sparse) GL with overlap. For the case of overlapping groups, the theory developed in 
Negabhan et al. [16] continues to apply so long as the penalty terms in (2.7) and (2.8), respectively, 
are 'decomposable'. The most general results available to date address the case where the groups 
are 'tree structured', that is, 

d {Gi c Gj or Gj c GJ. (2.9) 

See, for example, Obozinski et al. [27] and Jenetton et al. [28]. 

Now, if we examine the groups associated with the network in figure la, it is obvious that (2.9) 
is not satisfied. However, there is a slight modification that would permit (2.9) to hold, namely to 
drop the root node and retain only the successors. Thus, the various groups are 

Gi = {4}, G 2 = {5}, G 3 = {6}, G 4 = {7}, G 5 = {2,4}, 
G 6 = {2,5}, G 7 = {3,6} and G 8 = {3,7}. 



However, there is no way of modifying the groups so as to ensure that (2.9) holds for the network 
in figure lb. The reason is easy to see. The 'tree structure' assumption (2.9) implies that there 
is only one path between every pair of nodes. But this is clearly not true in figure lb, because 
there are two distinct paths from node 1 to node 5. Moreover, a little thought would reveal that 
that the assumption of tree-structured groups does not really permit truly overlapping groups. 
In particular, if (2.9) holds, then the collection of sets {G\, . . . , G g ] can be expressed as a union of 
chains in the form 

Gn ^ • • • Q Gigu • • • / G s i c . . . c G sgs , 

where the 'maximal' sets Gj gj are pairwise disjoint once duplicates are removed, and together span 
the total feature set J\f = {1, . . . ,n}. Now, in a biological network, it makes no sense to impose a 
condition that there must be only one path between every pair of nodes. Therefore, the problem of 
defining a decomposable norm penalty for inducing other types of sparsity besides tree structure, 
especially the types of sparsity that are consistent with biology, is still open. 

We conclude this section with a new algorithm and its application to sparse regression. This 
represents joint work with Mehmet Eren Ahsen and will be presented in more complete form 
elsewhere. A special case of SGL is obtained by choosing just one group, which perforce has to 
equal J\f, so that 

m 

/men = YM w ~°~ y {)1 + _ + ^ww- 

i=l 

Of course, as the entire index set AT is chosen as one group, there is nothing 'sparse' about it. Note 
that the only difference between (2.10) and (2.5) is that the i2~ n orm is not squared in the former. 
For this reason, the above approach is called the 'modified elastic net' or MEN algorithm. Unlike 
in EN, the penalty (or constraint) term in MEN is a norm, being a convex combination of the l\- 
and £2 -n o rm s. In several examples, the MEN algorithm appears to combine the accuracy of EN 
with the sparsity of LASSO. It is relatively easy to prove an analogue of theorem 2.1 for the MEN 
algorithm. That is, unlike in LASSO but as in EN, MEN assigns nearly equal weights to highly 
correlated features. But further theoretical analysis remains to be carried out. 

The MEN algorithm was applied to the TCGA ovarian cancer data [29] to predict the time 
to tumour recurrence. Specifically, both times to tumour recurrence as well as expression levels 
for 12 042 genes are available for 283 patients. Out of these, 40 patients whose tumours recurred 
before 210 days or after 1095 days were excluded from the study as being 'extreme' cases. The 
remaining 243 samples were analysed using MEN with recursive feature elimination (RFE). The 
results are shown in figure 2. The number of features and the average percentage error in absolute 
value are shown in table 1. 

3. Compressed sensing 

In recent years, there have been several results that are grouped under the general heading of 
'compressed sensing' or 'compressive sensing'. Both expressions are in use, but 'compressed 
sensing' is used in this paper. The problem can be roughly stated as follows: suppose xeR" 
is an unknown vector but with known structure; is it possible to determine x either exactly 
or approximately by taking m^n linear measurements of x? The area of research that goes 
under this broad heading grew spectacularly during the first decade of the new millennium. 5 
As summarized in the introduction of the paper [31], the impetus for recent work in this area was 
the desire to find algorithms for data compression that are 'universal' in the sense of being non- 
adaptive (i.e. do not depend on the data). In the original papers in this area, the results and proofs 
were a mixture of sampling, signal transformation (time domain to frequency domain and vice 
versa), randomness, etc. However, as time went on, the essential ingredients of the approach were 
identified, thus leading to a very streamlined theory that clearly transcends its original application 
domains of image and signal processing. 

5 In Davenport et al. [30], it is suggested that a precursor of compressed sensing can be found in a paper that dates back to 
1795! 





Figure 2. Predicted versus actual times to tumour recurrence in 243 ovarian cancer patients. The results for the LASSO algorithm 
are in (a), those for the EN algorithm are in (b) and those for the MEN algorithm are in (c). (Online version in colour.) 



Table 1. Comparison of three algorithms on TCGA ovarian cancer data on time to tumour recurrence, with extreme cases 
excluded. 



1 algorithm 


no. features 


average perc. error (%) 1 


LASSO 


43 


16.14 


EN 


60 


14.35 


MEN 


42 


14.91 



Image processing is one of the potential (and actually realized) applications of compressed 
sensing, and, as such, the theory has already been applied to the processing of biological images. 
Other than this, there do not appear to be any applications of the theory to cancer biology. 
Perhaps this can be attributed to the relative newness of the subject. The motivation for discussing 
compressed sensing theory in this paper is the following: whether it is in compressed sensing or 
in computational biology, one searches for a relatively simple explanation of the observations. 
Therefore, it may potentially be possible to borrow some of the basic ideas from compressed 
sensing theory and adapt them to problems in cancer biology. Compressed sensing theory as 



it currently stands cannot directly be applied to the analysis of biological datasets, because the 
fundamental assumption in compressed sensing theory is that one is able to choose the so-called 
measurement matrix, called A throughout this paper. Note that, in statistics, the matrix A is 
often referred to as the 'design' matrix. However, in biological (and other) applications, the 
measurement matrix is given, and one does not have the freedom to change it. Nevertheless, the 
developments in this area are too important to be ignored by computational biologists. The hope 
is that, by understanding the core arguments of compressed sensing theory, the computational 
biology community would be able to adapt the theory for its application domain. In parallel, 
those who are well versed in compressed sensing theory can start thinking about how the basic 
arguments can be modified to the case where the measurement matrix is specified, and cannot 
be chosen. 

The major developments in this area are generally associated with the names of Candes, 
Donoho, Romberg and Tao, though several other researchers have also made important 
contributions. See Donoho [31] for one of the earliest comprehensive papers, as well Donoho 
[32], Candes [33], Candes & Tao [34,35], Candes & Plan [36], Romberg [37] and Cohen et al [38]. 
The survey paper [30] and a recent paper [16] contain a wealth of bibliographic references that 
can be followed up by interested readers. 

We begin by introducing some notation. Suppose m, n, k are given integers, with n > 2k. For 
convenience, we denote the set {1, . . . , n] by M throughout. For a given vector x e R n , let supp(x) 
denote its support, that is, supp(x) = {i : x\ ^ 0}. Let, E k = {x e R n : |supp(x)| < k). Thus, E k denotes 
the set of '/c-sparse' vectors in R n , or, in other words, the set of n-dimensional vectors that have 
k or fewer non-zero components. For each vector x e R n , integer k < n and norm || • || on R n , the 
symbol o k (x, || • ||) denotes the distance from x to E kr that is, 



The quantity a k {x, || • ||) is called the 'sparsity measure' of the vector x of order k with respect to the 
norm || • ||. It is obvious that o k {x, || • ||) depends on the underlying norm. However, if || • || is one 
of the €p-norms, then it is easy to compute o k (x, || • ||). Specifically, given k, let vlo denote the index 
set corresponding to the /c-largest components of x in magnitude, and let x A ^ denote the vector 
that results by replacing the components of x in the set Aq by zeros. (It is convenient to think of 
xa c as an element of R n rather than an element of R n ~ k .) Then, whenever p e [1, oo], it is easy to 
see that 



Next, the so-called 'restricted isometry property' (RIP) is introduced. Note that, in some 
cases, the RIP can be replaced by a weaker property known as the 'null space property' [38]. 
However, the objective of this paper is not to present the most general results, but rather to present 
reasonably general results that are easy to explain. So the exposition below is confined to the RIP. 

Definition 3.1. Suppose A e R mxn . We say that A satisfies the RIP of order k with constant 8 k if 



So the matrix A has the RIP of order k with constant 1 — 8 k if the following property 
holds: for every choice of k or fewer columns of A (say the columns in the set / c A/", where 
l/l < k), the spectrum of the symmetric matrix A^Aj lies in the interval [1 — 8 k , 1 + 8 k ], where 
Aj eR mx '^ denotes the submatrix of A consisting of all rows and the columns corresponding 
to the indices in /. 

If integers n, k are specified, the integer m has to be sufficiently large in order for the matrix A 
to satisfy the RIP. 

Theorem 3.2 (Davenport et al. [30, theorem 1.4]). Suppose A e R mxn satisfies the RIP or order 2k 
with constant 8 2k £ (0, 1/2]. Then, 



cr k (x,\\'\\)=mi{\\x-z\\:zeE k }. 



• Hp) = 11*4; Hp. 



(1 - S k )\\u\\l < {u,Au) < (1 + 8 k )\\u\\\, Vu e E k . 



(3.1) 




(3.2) 



where 

<< 0.28. 



where 



21og(V24 + l) 

Next, we state some of the main known results in compressed sensing. The theorem statement 
below corresponds to Candes [33, theorem 1.2] and Davenport et al. [30, theorem 1.9]. 

Theorem 3.3. Suppose A e R mxn satisfies the RIP of order 82k with constant 82k < V2 — 1, and that 
y = Ax + y] for some xeW 1 and r\ e R m with \\ rj || 2 < €. Define 

x = argmin ||z||i s.t. \\y — Az\\2 < €. (3.3) 

Then, 

l|x-x|| 2 <Co^^ + C 2 ,, (3.4) 

C 0 ^2 1 + ( f- lfe and C 2 = ■ (3.5) 

1-(V2 + 1)% 1-(V2 + 1)% 

The formula for C 2 is written slightly differently from that in Davenport et al. [30, theorem 1.9] 
but is equivalent to it. 

Corollary 3.4. Suppose A e R mxn satisfies the RIP of order 82k with constant 82k < V2 — 1, and that 
y = Ax-\-r]for some xe Uk an d rj e R m with \\ r\ || 2 < e. Define 

x = argmin || z ||i s.t. \\y — Az|| 2 < e. (3.6) 

zeW 1 

Then, 

\\x-x\\ 2 <C 2 e, (3.7) 

where C 2 is defined in (3.5). 

Corollary 3.5. Suppose A e R mxn satisfies the RIP of order 82k with constant 82k < V2 — 1, and that 
y = Ax for some x e Uk- Let, A~ l (y) := {z e R n : y = Az] and define 

x = argmin ||z ||i. (3.8) 

zeA-Hy) 

Then, x = x. 

Both corollaries follow readily from the bound (3.4). Note that if x e X^, then okfx, || • ||i) = 0. 
Thus, (3.4) implies that \\x — x|| 2 < C 2 6 if there is measurement error, and ||x — x|| 2 = 0, i.e. that 
x = x, if there no measurement error. 

Corollary 3.4 is referred to as the 'near ideal' property of the LASSO algorithm. Suppose that 
x g Uk so that x is /c-sparse. Let S denote the support of x, and let As e R mx ' s ' denote the submatrix 
of A consisting of the columns corresponding to indices in S. If an 'oracle' knew not only the size 
of S, but the set S itself, then the oracle could compute x as 



Then, the error would be 



Xomc\e = (A T s A s ) 1 A T s y = x + (A T s A s ) 1 A T s r } . 



Il^oracle ~ A\l = KA^s) 1 Ajrj\\ 2 < Const • € 



for some appropriate constant. On the other hand, if x e Ek, then o>(x, || • ||i) = 0, and the right- 
hand side of (3.4) reduces to (3.7), that is, 

\\X-X\\ 2 <C 2 €. 

The point therefore is that, if the matrix A satisfies RIP, and the constant 82k satisfies the 
'compressibility condition' 82k < V2 — 1, then the mean-squared error of the solution to the 



optimization problem (3.6) is bounded by a fixed (or 'universal') constant times the error bound 
achieved by an 'oracle' that knows the support of x. 

It should be noted that there is a parallel, and closely related, set of papers that study the 
following problem: given a matrix A eR mxn , a feature vector x that is known to be /c-sparse 
but otherwise unknown, and a random measurement error w assuming values in R m , suppose 
one is given the noise-corrupted measurement y = Ax + w. To recover x from y, one solves the 
minimization problem 

x* = argmin \\y - Az\\\ + \\\z\\ lf (3.9) 

zeR n 

where I is a user-specified penalty weight. What, if any, is the relationship between x* and x? 
It is easy to see that the objective function in (3.9) is just the Lagrangian associated with the 
constrained objective function in (3.3). Specifically, if k is sufficiently large, then large values of 
||z||i are penalized, and the problem in (3.9) begins to resemble that in (3.3). Of course, the bound 
e on the magnitude of the noise is not present in the problem formulation (3.9). In Candes & Plan 
[36] and Negabhan et al. [16], the above problem is analysed, and probabilistic (with respect to 
the random noise w) bounds analogous to (3.7) are derived. Indeed, [16] contains a very general 
theory wherein the ^2 _n o rm oiy — Az is replaced by an arbitrary convex function, and the €i-norm 
is replaced by any decomposable norm. 

The advantage of the above theorem statements, which are taken from Candes [33] and 
Davenport et al. [30], is that the role of various conditions is clearly delineated. For instance, the 
construction of a matrix A e R mxn that satisfies the RIP is usually achieved by some randomized 
algorithm. In Candes & Tao [34, theorem 1.5], such a matrix is constructed by taking the columns 
of A to be samples of i.i.d. Gaussian variables. In Achlioptas [39], Bernoulli processes are used 
to construct A, which has the advantage of ensuring that all elements a^ have just three possible 
values, namely 0, +1, —1. A simple proof that the resulting matrices satisfy the RIP with high 
probability is given in Baraniuk et al. [40]. Neither of these construction methods is guaranteed 
to generate a matrix A that satisfies RIP. Rather, the resulting matrix A satisfies RIP with some 
probability, say > 1 — y\ . The probability y\ that the randomized method may fail to generate 
a suitable A matrix can be bounded using techniques that have nothing to do with the above 
theorem. Similarly, in case the measurement matrix A satisfies the RIP but the measurement 
noise -q is random, then it is obvious that theorem 3.3 holds with probability > 1 — yi, where 
Y2 is a bound on the tail probability F*r{ || 77 1| 2 > e}- Again, the problem of bounding this tail 
probability has nothing to do with theorem 3.3. By combining both estimates, it follows that if 
the measurement matrix A is generated through randomization, and if the measurement noise is 
also random, then theorem 3.3 holds with probability > 1 — y\ — yi- 

Observe that the optimization problem (3.6) is 

min||z||i s.t. \\y — Az\\2 < €. 

This raises the question as to whether the £i-norm can be replaced by some other norm || • ||p that 
induces some other form of sparsity, for example group sparsity. If some other norm is used in 
place of the ^i-norm, does the resulting algorithm display near-ideal behaviour, as does LASSO? 
In other words, is there an analogue of theorem 3.3 if || • ||i is replaced by another penalty || • \\p? 
In joint work with Ahsen [41], the author has proved a very general theorem to the following 
effect: whenever the penalty norm is 'decomposable' and the measurement matrix A satisfies a 
'group RIP', the corresponding algorithm has near-ideal behaviour provided a 'compressibility 
condition' is satisfied. The result is described in brief. 

Let Q = {Gi, . . . , Gg] be a partition of Af = {1, . . . , ft}. This implies that the sets G z - are pairwise 
disjoint. If Sc{l,...,g}, define Gs := U z - 6 sG z -. Let k be some integer such that k > max/ |G/|. 
A subset A c J\f is said to be S-group k-sparse if A = Gs and \Gs\ <k, and group k-sparse if it is 
S-group /c-sparse for some set S c {1, . . . ,g}. The symbol GkS c 2^ denotes the collection of group 
/c-sparse sets. 

Suppose || • ||p : R n -> R + is some norm. The next definition builds on an earlier definition from 
Negabhan et al. [16]. 



Definition 3.6. The norm || • \\p is decomposable with respect to the partition Q if the following 
is true: whenever u,v eR n are group /c-sparse with support sets A u c Gs x , A v <z Gs 2 and the sets 
Si, S2 are disjoint, it is true that 

\\u + v\\p = \\u\\p + \\v\\ P . (3.10) 

By adapting the arguments in Negabhan et al. [16], it can be shown that the GL norm used in 
(2.7), namely 



II*IIgl:=^>/^I|zg z II2, 
1=1 



and the SGL norm used in (2.8), namely 

g 

IMISL — X^ 1 -»)\\ x Gi\\l +Mll*G/ll2]/ 
1=1 

are both decomposable. 

Next, the notion of RIP is extended to groups. 

Definition 3.7. A matrix A e R mxn is said to satisfy the group RIP of order k with constants p fc , p^ 

if 

\\Az\\\ \\Az\\\ 

0 < p 1 < min min ^ < max max ^ < pk- (3.11) 

~ K AeGkS SUpp(z)cA || Z ||^ AeGkS SUpp(z)cA \\ z \\^ 



We define 8k := (pk — P k )/2 and introduce some constants 

. \\*a\\p , j \\*a\\p 

c := mm mm and a := max max . (3.12) 

AeGkS x A ^0 \\X A \\2 AeGkS x A ^0 \\x A \\2 

With these definitions, the following theorem can be proved. 

Theorem 3.8. Suppose A e R mxn satisfies the group RIP property of order 2k with constants (p 2k , p2k), 
respectively, and let 82k = (P2k — P2l)^' ^ u PP ose xeR n and that y = Ax + rj, where \\r}\\2 < e. Suppose 
that the norm \\ ■ \\p is decomposable, and define 

x = argmin \\z\\p s.t. \\y — Az\\2<€. (3.13) 

zeR n 

Suppose that the compressibility condition 

% < -f (3.14) 

is satisfied. Then, 

||%-x||p<-=-[2(l + r)a+fe] (3.15) 
1 — r 

and 

fx - x\\ 2 < -77^ [2(1 + r)o + f € ], (3.16) 
c(l - r) 



where o is shorthand for the sparsity index 



and 



0 = °k,g(x, II • Hp) := min \\x - x A \\ P = min \\x A c \\ P (3.17) 
AeGkS AeGkS 0 



r ..J*l and ?:= ^, (3.18) 



and c, d are defined in (3.12). 

In the above theorem, (3.14) replaces the compressibility condition 82k < V2 — 1 of theorem 3.3. 
The resemblance of (3.16) to (3.4) is obvious. Consequently, (3.16) can be readily interpreted as 
stating that minimizing the decomposable norm || • | ||p leads to near-ideal behaviour. 



Figure 3. A linearly separable dataset. (Online version in colour.) 



4. Classification methods 

The basic problem of classification can be stated as follows: suppose we are given a collection 
of labelled vectors (x t ,yj),i = \,...,m, where each x l eR n is viewed as a row vector and each 
\jl G {— 1, 1}. For future use, define 

M\ :={i:yi = l} and Mi = {i: yi = -1}. 

The objective of (two-class) classification is to find a discriminant function f : W 1 -> R such that/(x z ) 
has the same sign as yi for all i, or equivalently y z - • sign (f(x 1 )) = 1 for all i. In the present context, 
the objective is not merely to find such a discriminant function, but, rather, to find one that uses 
relatively few features. 

In many ways, classification is an easier problem than regression, because the sole criterion 
is that the discriminant function f(x l ) should have the same sign as the label yi for each i. 
Thus, if / is a discriminant function, so is af for every positive constant a, and, more generally, 
so is any function whenever 0 is a so-called 'first- and third-quadrant function', i.e. 

where (p(u) > 0 when u > 0 and (p(u) < 0 when u < 0. This gives us great latitude in choosing a 
discriminant function. 

(a) The ^-norm support vector machine 

This section is devoted to the well-known SVM, first introduced in Cortes & Vapnik [42], which 
is among the most successful and most widely used tools in machine learning. To distinguish 
this algorithm from its variants, it is referred to here as the ^2 _n o rm SVM, for reasons that will 
become apparent. 

A given set of labelled vectors {{x l ,yi),x l e R n ,y z - e {—1, 1}} is said to be linearly separable if 
there exist a 'weight vector' w e W 1 (viewed as a column vector) and a 'threshold' 0 e R such that 
f(x) = xw — 0 serves as a discriminant function. Equivalently, the dataset is linearly separable if 
there exist a weight vector w e R n and a threshold 0 e R such that 

x l w>6VieMi and x l w <0"iie Mi- 

To put it yet another way, given a weight w and a threshold 0, define H = H{w, 0) by 

H := {x e R n : xw - 0 = 0}.H+ := {x e R n : xw - 0 > 0}, H- := {x e R n : xw - 0 < 0}. 

The dataset is linearly separable if there exists a hyperplane TL such that x l e TL+Vi e M\ and 
x l eH- VieM 2 - 

The situation can be depicted as in figure 3a, in which the dots on either side of the dashed line 
represent the two classes. It is clear that linear separability is not affected by swapping the class 
labels. 

It is easy to see that, if there exists one hyperplane that separates the two classes, there exist 
infinitely many such hyperplanes. The question therefore arises as to which of these choices is 
the best. The SVM introduced in Cortes & Vapnik [42] chooses the separating hyperplane such 



Figure 4. Optimal separating hyperplane. (Online version in colour.) 



that the nearest point to the hyperplane within each class is as far as possible from it. In the original 
SVM formulation, the distance to the hyperplane is measured using the Euclidean or £2 -norm. To 
illustrate the concept, the same dataset as in figure 3 is shown again in figure 4, with the 'optimal' 
separating hyperplane, and the closest points to it within the two classes shown as hollow circles. 
In symbols, the SVM is obtained by solving the following optimization problem: 

maxmin inf ||u — x l \\. 

w,6 i veH 

An equivalent formulation of the SVM is obtained by observing that the distance of the separating 
hyperplane to the nearest points is given by c/\\ w\\, where 

c := min \yi(x l w — 0)\ = min \yi{x l w — 0)\, 

where the equality of the two terms follows from the manner in which the separating hyperplane 
is chosen. Moreover, the optimal hyperplane is invariant under scale change, that is, multiplying 
w and 6 by a positive constant. Therefore, there is no loss of generality in taking the constant c to 
equal one. With this rescaling, the problem at hand becomes the following: 

nunllwll s.t. x l w > 1 Vz e Mi and x l w < -1 Vz e Mi- (4.1) 

w 

This is the manner in which the SVM is implemented nowadays in most software packages. 

The original SVM formulation presupposes that the dataset is linearly separable. It is easy to 
determine whether or not a given dataset is linearly separable, because that is equivalent to the 
feasibility of a linear programming problem. This naturally raises the question of what is to be 
done in case the dataset is not linearly separable. One way to approach the problem is to choose 
a hyperplane that misclassifies the fewest number of points. While appealing, this approach is 
impractical, because it is known that this problem is NP-hard; see Hoffgen et al. [43] and Natarajan 
[15]. A tractable approach is to replace this problem by its convex relaxation. We will return to this 
issue when we discuss £i-norm SVMs. 

An alternative approach to guarantee that the data are linearly separable can be obtained using 
Vapnik-Chervonenkis theory [44,45]. Suppose that the n vectors x\, . . . , x n do not lie on an (p — 1)- 
dimensional hyperplane in W. In such a case, whenever p > n — 1, the dataset is linearly separable 
for every one of the 2 n ways of assigning labels to the n vectors. This result suggests that, if a given 
dataset is not linearly separable, it can be made so by increasing the dimension of the data vectors 
x l , for instance, by including not just the original components but also their higher powers. This 
is the rationale behind so-called 'higher order' SVMs, or, more generally, kernel-based classifiers 
(e.g. Cristianini & Shawe-Taylor [46] and Scholkopf & Smola [47]). 

If the norm in (4.1) is the ^-norm, then the minimization problem (4.1) is a quadratic 
programming problem, which can be solved efficiently for extremely large datasets. Moreover, 
the introduction of new data points does not alter the optimal hyperplane, unless one of the 
new data points is closer to the hyperplane than the earlier closest points. This is illustrated in 
figure 5, which contains exactly the same vectors as in figure 4, plus two more. The optimal 



Figure 5. Insensitivity of optimal separating hyperplane to additional samples. (Online version in colour.) 



hyperplane remains the same. For all these reasons, the SVM offers a very attractive approach 
to finding a classifier in situations where the number of features is smaller than the number of 
samples. On the other hand, generically the optimal weight vector has all non-zero components, 
which is undesirable when the number of samples m is too large, even if m < n. To overcome this 
problem, an approach known as RFE is suggested in Guy on et al. [48]. This consists of solving the 
SVM problem (4.1), identifying the component of the weight vector with the smallest magnitude, 
discarding it, re-solving the problem and repeating. Though it is claimed in Guyon et al. [48] that 
the method works well on a leukaemia dataset, in general RFE applied to the traditional £2 -norm 
SVM displays rather erratic behaviour. 



As we have seen, in biological applications, the number of features (the dimension of the 
vectors x l ) is a few orders of magnitude larger than the number of samples (the number of 
vectors). In such a case, because of the results in Wenocur & Dudley [44], linear separability is 
not an issue. However, in general, every component of the optimal weight vector w is non-zero. 
This means that a classifier uses every single feature in order to discriminate between the classes. 
Clearly, this is undesirable. The original SVM formulation presupposes that the dataset is linearly 
separable. If the data are not linearly separable, then as shown in Hoffgen et al. [43] and Natarajan 
[15], the problem of finding a hyperplane that misclassifies the fewest points is NP-hard. An 
alternative approach is to formulate a convex relaxation of this NP-hard problem by introducing 
slack variables into the constraints in (4.1), and then minimizing an appropriate norm of the vector 
of slack variables. Finally, in many problems, the consequences of misclassification might not 
be symmetric. A false positive (labelling a sample as positive when in fact it is negative) might 
have far more, or far less, severe consequences than a false negative. In this section, we present 
a problem formulation that addresses all of these issues. This problem formulation combines the 
ideas in two papers, namely [49,50]. 

If we choose a particular norm || • || to measure distances in 'feature space', then distances in 
'weight space' should be measured using the so-called dual norm, defined by 



In particular, if we measure distances in feature space using the €i-norm, then distances in weight 
space should be measured using its dual, which is the -norm. With this observation, the 
problem can be formulated as follows: 



(b) The £rnorm support vector machine 



w\\ d := sup \xw\. 
IWI<1 



min (1 — k) 

w,6,y,z 




Z{ + X max \wi\ s.t. 

l<i<n 



Li=l 



i=l 



(4.2) 



x l w - 0 + xji > 1 Vz e Mi, x l w-0 - zi < -1 Vz e Mi, 



y>0 f 



and z > 0 m2 . 



This can be converted to 



min (1 

w,6,y,z 



-X) 



XIV - 



i=l i=l 

+ yi>lVieMi, 



+ Xv s.t. 



XIV ■ 



nil/ 



z > 0 m2 and v > w\ Vz, v > —W[ Vz. 



(4.3) 



This is clearly a linear programming problem. In this formulation, A is a 'small' constant in (0, 1), 
much closer to 0 than it is to 1. Suppose that the original dataset is linearly separable, and let uf , 0* 
denote a solution to the optimization problem in (4.1), where \\w\\d replaces \\w\\. Then the choice 

w = w*, 0 = 0*, y = 0 mi and z = 0 m2 

is certainly feasible for the optimization problem (4.2). Moreover, if X is sufficiently small, any 
reduction in \\wWd achieved by violating the linear separation constraints (i.e. permitting some 
yi or zi to be positive rather than zero) is offset by the increase in the term (1 — A)||(y,z)||. It is 
therefore clear that, if the dataset is linearly separable, there exists a critical value Xq > 0 such that, 
for all X < Xq, the optimization problem (4.2) has (w*, 0, 0 mi ,0 m2 ) as a solution. On the other hand, 
the optimization problem (4.2) remains meaningful even when the data are not linearly separable. 

The final aspect of the problem, as suggested in Veropoulos et ah [50], is to introduce a trade- 
off between false positives and false negatives. In this connection, it is worthwhile to recall the 
definitions of the accuracy, etc. of a classifier. Given a discriminant function/(-), define 

C 1 :={ieM:f(x i )>0} and C 2 := {i e M :/(x z ) < 0}. 

Thus, C\ consists of the samples that are assigned to class 1 by the classifier, while C2 consists of 
the samples that are assigned to class 2. Then, this leads to the array shown below 



Mi 
M 2 



Ci 
TP 
FP 



C 2 
FN 
TN 



In the above array, the entries TP, FN, FP and TN stand for 'true positive', 'false negative', 'false 
positive' and 'true negative', respectively. 



Definition 4.1. With the above definitions, we have 

TP IC1n.M1i 



Se= ; 



Sp: 



TP + FN 
TN 



\M!\ ' 
\C 2 nM 2 \ 



and 



Ac= ■ 



FP + TN \M 2 \ 
TP + TN 



|CinyWi| + |C 2 nyw 2 | 



TP + TN + FP + FN \M!\ + \M 2 \ 

where Se, Sp and Ac stand for the sensitivity, specificity and accuracy, respectively. 



(4.4) 
(4.5) 
(4.6) 



All three quantities lie in the interval [0,1]. Moreover, accuracy is a convex combination of 
sensitivity and specificity. In particular, 



Ac = Se • 



\Mi\ 



\M X \ + \M 2 \ 



+ Sp- 



\M 2 \ 



\Mi\ + \M 2 \ 



Therefore, 



min{Se, Sp} < Ac < max{Se, Sp}. 



Also, the accuracy of a classifier will be roughly equal to the sensitivity if M\ is far larger than 
M 2 , and roughly equal to the specificity if M 2 is far larger than M\. 



In many classification problems, the consequences of misclassification are not symmetric. To 
capture these kinds of considerations, another parameter a e (0, 1) is introduced, and the objective 
function in the optimization problem (4.2) is modified by making the substitution 

+ J2 Zi *~ a J2y j + ( x _ a ) H Zi > 

i=l i=l i=l i=l 

where we adopt the computer science notation <<— to mean 'replaces'. If a = 0.5, then both false 
positives and false negatives are weighted equally. If a > 0.5, then there is greater emphasis 
on correctly classifying the vectors in M\, and the reverse if a < 0.5. With this final problem 
formulation, the following desirable properties result: 

— the problem is a linear programming problem and is therefore tractable even for 
extremely large values of n, the number of features; 

— the formulation can be applied without knowing beforehand whether or not the dataset 
is linearly separable; 

— the formulation provides for a trade-off between false positives and false negatives; and 

— most important, the optimal weight vector w has at most m non-zero entries, where m is 
the number of samples. Hence, the classifier uses at most m out of the n features. 

For these reasons, the i\-noxvcv SVM forms the starting point for our further research into 
classification. 

Until now, the discussion has been on two-class classification. The SVM framework does not 
extend to multi-class classification very readily. For one thing, when there are only two classes, 
it does not matter which class is labelled 'positive' and which is labelled 'negative'. On the other 
hand, when there are multiple classes, it is necessary to distinguish between two cases: in the 
first, there is a natural ordering among the class labels. For instance, if a patient's response is to 
be categorized as poor, fair, good, very good and excellent, the ordering is clear. In the second, 
there is no natural ordering, for example, if one wishes to assign a breast cancer tumour into one 
of the four subtypes mentioned above. The paper [51] is among the more popular methods for 
multi-class SVM. 

(c) Some applications of support vector machines to cancer 

In contrast with sparse regression, there are many applications of sparse classification methods to 
cancer biology. A search of the Pubmed database of the National Library of Medicine, USA, with 
the string 'SVM cancer' returns several hundred results. The vast majority of these papers present 
applications where human experts pare the hundreds or even thousands of measured features to 
a small subset, to which a standard £2 -norm SVM is applied. In other words, though the raw 
number of measured features is very high, the actual number of features used by the SVM is 
smaller than the number of samples. In principle, the ^i-norm SVM can be applied to the original 
feature set to choose the most predictive features out of the overall feature set. But for the most 
part, existing applications appear to leave the task of feature selection to human experts and not 
to the algorithm. The fraction of SVM applications that exploit the feature selection property of 
the ^i-norm SVM is quite small. Two reasons, not mutually exclusive, can be proposed to explain 
this. First, the users might simply be unfamiliar with the ^i-norm SVM methodology. Second, 
the users might believe that a purely data-driven approach might result in a feature set whose 
biological significance is unclear. 

The paper [48] that introduces the technique of RFE to the SVM algorithm studies the 
application of the technique to leukaemia. It is shown that the SVM-RFE method eventually leads 
to just two features being retained. However, several authors report that the performance of the 
SVM-RFE approach is in general somewhat erratic. It would appear therefore that the dataset 
studied in Guyon et ah [48] is particularly amenable to the use of this technique. An excellent 
review of several applications of SVMs to ovarian cancer can be found in Sabatier et al. [52]. 



Other examples of ovarian cancer applications include Han et ah [53], in which 322 samples are 
analysed to generate a 349-gene biomarker panel which performs very well, but when the 349 
genes are reduced to 18 genes the performance on the test data is poor; Denkert et ah [54], in 
which a 300-gene ovarian carcinoma index is constructed on the basis of 80 samples, which is 
then tested on 118 samples; and Hartmann et ah [55], in which a panel of 14 genes is identified to 
differentiate between early relapse and late-stage relapse. Yet, these papers ignore the result from 
Wenocur & Dudley [44], which states that if the number of features used is in excess of the number 
of samples in the training data, then generically an SVM can achieve 100% accuracy, sensitivity 
and specificity on the training data, irrespective of the assignment of labels to the samples. As 
neither Han et ah [53] nor Denkert et ah [54] reports such a phenomenon, it is unclear whether 
these papers have implemented the SVM algorithm accurately. 

As illustrations of applications to other forms of cancer, one can mention: Sabatier et ah [56], in 
which a 368-gene expression signature is trained on 2145 basal breast cancer samples, and then 
tested on another set of 2034 samples, with the aim of predicting the patient's prognosis; and 
Klement et ah [57], in which seven features were selected by human experts to study 399 NSCLC 
patients. As an example of using not just SVM but also RFE, one can cite Yang et ah [58], in which 
the efficacy of drug candidates against hepatocellular carcinoma (liver cancer) is studied. As the 
efficacy of drug candidates is actually a real number between 0 and 1, the authors discretize the 
efficacy into two bins: [0, 0.4] and [0.6, 1]. Apparently, their only motivation is to make the problem 
fit into the SVM framework. Therefore, it would be worthwhile to apply sparse regression 
techniques (as opposed to sparse classification) to such problems. Finally, an application of the 
multi-class SVM methodology of Crammer & Singer [51] is found in Huang et ah [59]. 



(d) The lone star algorithm 

As pointed out in §4, both the traditional i2-rvoxvcv SVM and the £i-norm SVM can be used for 
two-class classification problems. When the number of samples m is far larger than the number of 
features ft, the traditional SVM performs very satisfactorily, whereas the €i-norm SVM of Bradley 
& Mangasarian [49] is to be preferred when m < ft. Moreover, the i\-novvcY SVM is guaranteed 
to use no more than m features. However, in many biological applications, even m features are 
too many. Biological measurements suffer from poor repeatability. Therefore, a classifier that 
uses fewer features would be far preferable to one that uses more features. In this section, we 
present a new algorithm for two-class classification that often uses far fewer than m features, thus 
making it very suitable for biological applications. The algorithm combines the £-norm SVM of 
Bradley & Mangasarian [49], RFE of Guy on et ah [48] and stability selection of Meinshausen & 
Buhlmann [60]. A preliminary version of this algorithm was reported in Ahsen et ah [61]. Note 
that Li et ah [62] introduces an algorithm known as SVM-T-RFE that contains some similarities to 
this algorithm. 

The algorithm is as follows: 

(1) Choose at random a 'training set' of samples of size k\ from M\ and size ki from M2, 
such that k\ < raj/2 and k\,ki are roughly equal. Repeat this choice s times, where s is 
a Targe' number. This generates s different 'training sets', each of which consists of k\ 
samples from M\, 1 = 1,2. 

(2) For each randomly chosen training set, compute a corresponding optimal ^i-norm 
SVM using the formulation (4.2). This results in s different optimal weight vectors and 
thresholds. 

(3) Let k denote the average number of non-zero entries in the optimal weight vector across 
all randomized runs. Average all s optimal weight vectors and thresholds, retain the 
largest k components of the averaged weight vector and corresponding feature set, and 
set the remaining components to zero. This results in reducing the number of features 
from the original ft to k. 



(4) Repeat the process with the reduced feature set, but the originally chosen randomly 
selected training samples, until no further reduction is possible in the number of features. 
This determines the final set of features to be used. 

(5) Once the final feature set is determined, carry out twofold cross validation by dividing 
the data s times into a training set of k\,ki randomly selected samples and assessing 
the performance of the resulting ^i-norm classifier on the testing dataset, which is the 
remainder of the samples. Average the weights generated by the t < s best-performing 
classifiers, where t is chosen by the user, and call that the final classifier. 

When the number of features n is extremely large, an optional pre-processing step is to 
compute the mean value of each of the n features for each class, and retain only those features 
wherein the difference between means is statistically significant using the 'Student' Mest. Our 
experience is that using this optional pre-processing step does not change the final answer very 
much, but does decrease the CPU time substantially. Note that, in Li et al. [62], a weighted 
combination of the weight of the ^-norm SVM and the Mest statistic is used to eliminate features. 

Now some comments are in order regarding the above algorithm: 

— in some applications, M\ and M2 are of comparable size, so that the size of the training 
set can be chosen to equal roughly half of the total samples within each class. However, 
in other applications, the sizes of the two sets are dissimilar, in which case the larger set 
has far fewer of its samples used in training; 

— step 1 of randomly choosing s different training sets differs from Guyon et al. [48], where 
there is only one randomized division of the data into training and testing sets; 

— for each random choice of the training set, the number of non-zero entries in the optimal 
weight vector is more or less the same; however, the locations of non-zero entries in the 
optimal weight vector vary from one run to another; 

— in step 3 above, instead of averaging the optimal weights over all s runs and then retaining 
the k largest components, it is possible to adopt another strategy. Rank all n indices in 
order of the number of times that index has a non-zero weight in the s randomized runs, 
and retain the top k indices. In our experience, both approaches lead to virtually the same 
choice of the indices to be retained for the next iteration; 

— instead of choosing s randomized training sets right at the outset, it is possible to choose 
s randomized training sets each time the number of features is reduced; and 

— in the final step, there is no distinction between the training and testing datasets, so the 
final classifier is run on the entire dataset to arrive at the final accuracy, sensitivity and 
specificity figures. 

The advantage of the above approach vis-a-vis the l2-norm SVM-RFE of Guyon et al [48] 
or the SVM-T-RFE of Li et al. [62] is that the number of features reduces significantly at each 
step, and the algorithm converges in just a few steps. This is because, in the €i-norm SVM, many 
components of the weight vector are 'naturally' zero and need not be truncated. By contrast, in 
general all the components of the weight vector resulting from the £2 _n o rm SVM will be non- 
zero; as a result, the features can only be eliminated one at a time, and in general the number of 
iterations is equal to (or comparable to) n, the initial number of features. 

The new algorithm can be appropriately referred to as the '^i-SVM Mest and RFE' algorithm, 
where SVM and RFE are themselves acronyms as defined above. Once again taking the first 
letters, we are led to the 'second-level' acronym '^i-StaR', which can be pronounced as 'ell-one 
star'. Out of deference to our domicile, we have decided to call it the Tone star' algorithm. 

The lone star algorithm was applied to the problem of predicting which patients of endometrial 
cancer are at risk of lymph node metastasis. These results are reported elsewhere. But in brief, the 
situation is the following: the endometrium is the lining of the uterus. When a patient contracts 
endometrial cancer, her uterus, ovaries and fallopian tubes are surgically removed. One of the 
major risks run by endometrial cancer patients is that the cancer will metastasize and spread 



through the body via pelvic and/ or para-aortic lymph nodes. The Gynecologic Oncology Group 
recommends that the patient's pelvic and para-aortic lymph nodes should also be surgically 
removed when the size of the tumour exceeds 2 cm in diameter. However, post-surgery analysis 
reveals that, even in this case, lymphatic metastasis is present in only 22% of the cases [63]. 

To predict the possibility of lymphatic metastasis, 1428 miRNAs were extracted from 94 
tumours, half with and half without metastasis. Using the lone star algorithm, 13 miRNAs were 
identified as being highly predictive. When tested on the entire training sample of 94 tumours, the 
lone star classifier correctly classified 41 out of 43 lymph-positive samples, and 40 out of 43 lymph- 
negative samples. In ongoing work, these miRNAs were measured on an independent cohort of 
19 lymph-negative and nine lymph-positive tumours. The classifier classified eight out of nine 
lymph-positive tumours correctly, and 11 out of 19 lymph-negative tumours correctly. Thus, 
while the specificity is not very impressive, the sensitivity is extremely good, which is precisely 
what one wants in such a situation. Moreover, using a two-table contingency analysis and the 
Barnard exact test, the likelihood of arriving at this assignment by pure chance (the so-called 
pvalue) is bounded by 0.011574. In biology, any p-value less than 0.05 is generally considered to 
be significant. 

5. Some topics for further research 

Both machine learning and computational biology are vast subjects, and their intersection 
contains many more topics than are touched upon in this brief article. Besides, there are other 
topics in computational cancer biology that do not naturally belong to machine learning, for 
example modelling tumour growth using branching processes. Therefore, the emphasis in this 
article has been on topics that are well established in the machine learning community and are 
also relevant to problems in computational cancer biology. 

Until now several 'penalty' norms have been proposed for inducing an optimization algorithm 
to select structured sparse feature sets, such as GL and SGL. As pointed out in §2, available 
extensions of these penalty norms to overlapping sets do not address biological networks where 
there are multiple paths from a master regulator to a final node. Any advance in this direction 
would have an immediate application to computational biology. 

Compressed sensing theory as discussed in §3 is based on the premise that it is possible to choose 
the measurement matrix A. The available theorems in this theory are based on assumptions on 
the measurement matrix, such as the RIP, or the null space property, and perhaps something even 
more general in future. In order to apply techniques from compressed sensing theory to cancer 
biology, it would be necessary to modify the theory to the case where the measurement matrix is 
given, and not chosen by the user. The RIP corresponds to the assumption that in an m x n matrix 
A, every choice of k columns results in a nearly orthogonal set. In actual biological data, such 
an assumption has no hope of being true, because the expression levels of some genes would be 
highly correlated with those of other genes. In Candes & Plan [36], the authors suggest that it is 
possible to handle this situation by first clustering the column vectors and then choosing just one 
exemplar from each cluster before applying the theory. Our preliminary attempts to apply such 
an approach to ovarian cancer data [29] are not very promising, leading to RIP orders of 5 or 10 — 
far too small to be of practical use. Thus, there is a need for the development of other heuristics 
besides clustering to extract nearly orthogonal sets of columns for actual measurement matrices. 
In this connection, it is worth pointing out [64] that group RIP is easier to achieve using random 
projections, as compared with RIP. However, it is not clear whether a 'given' A matrix is likely to 
satisfy a group RIP with a sufficiently large order. 

In general, it would appear that sparse regression is more advanced than sparse classification, 
with both well-established theoretical foundations as well as widely used algorithms in the 
former. By contrast, sparse classification does not have such a wealth of results. The lone star 
algorithm introduced here has performed well in several applications involving cancer data, and, 
at least for the moment, it appears to be the only available method to select far fewer features 
than the size of the training set of samples. As of now, there is no theoretical justification for this 



observed behaviour. Recall that the £i-norm SVM is guaranteed only to choose no more features 
than the size of the training set; but there is no reason to assume that it will use fewer. Therefore, 
it is certainly worthwhile to study when and why lone star and other such algorithms will prove 
to be effective. 
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